Predicting Ultra Marathon Times

Final Project

James Adams

Background

Ultra marathons

  • An ultramarathon is any race longer than the normal marathon length of ~42 kilometres (~26 miles)
  • 50k and 100k are both World Athletics record distances, but some 100 miles (160 km) races are among the oldest and most prestigious events
  • Around 100 miles is typically the longest course distance raced in under 24 hours

Participation

Dataset

Data cleaning

  • Data contained a lot of NaNs
  • Some were just straight up missing, or data entry errors - e.g. runner age of 0 or 133 years!
  • Investigated and filled or dropped as appropriate, e.g. using groupby on race_year_id to count and fill in missing participation numbers:
zero_participant_races = races[races.participants == 0].race_year_id.unique()

participant_counts = dict(rankings[rankings.race_year_id.isin(zero_participant_races)].groupby('race_year_id').runner.count())

for k, v in participant_counts.items():
    detailed_results['participants'].where(~(detailed_results.race_year_id == k), other=v, inplace=True)

DNFs

  • A lot were missing time values from runners that did not finish (DNF)
  • DNFs were split from the main data and stored in a separate dataset for possible secondary analysis
dnf = runners[runners['time_in_seconds'].isna()].copy()
results = runners[runners['time_in_seconds'].notna()].copy()

Problem statement



Can you predict the finishing time of a
given athlete profile for a given race?

Variables: runner gender

Variables: runner age

Variables: runner nationality

Variables: race characteristics

city distance elevation_gain elevation_loss aid_stations participants
0 Castleton 166.90 4520 -4520 10 150
1 Castleton 166.90 4520 -4520 10 150
2 Castleton 166.90 4520 -4520 10 150
3 Castleton 166.90 4520 -4520 10 150
4 Castleton 166.90 4520 -4520 10 150

Regression model

Dummy encoding

  • Categorical features needed encoding to work with regression
data_dm = pd.get_dummies(data, drop_first=True)
time runner_age distance elevation_gain elevation_loss aid_stations participants runner_gender runner_nationality_AND runner_nationality_ARG ... city_Yibin city_Yichang city_Ystad city_Zagreb city_Zalesie city_Zhaotong city_Äkäslompolo city_Åsa city_Örebro city_İstanbul
0 95725.00 30 166.90 4520 -4520 10 150 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 97229.00 43 166.90 4520 -4520 10 150 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 103747.00 38 166.90 4520 -4520 10 150 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 111217.00 55 166.90 4520 -4520 10 150 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 117981.00 48 166.90 4520 -4520 10 150 1 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 430 columns

Test/train split & Cross validation

  • Instantiating a KFold and LinearRegression object
  • Using cross_validate with the training data to obtain model metrics
X = data_dm.drop(columns='time')
y = data_dm.time

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2022)

lr = LinearRegression()
kf = KFold(n_splits=5, shuffle=True, random_state=2022)

val_scores = cross_validate(lr, X_train, y_train, cv=kf, scoring=('r2', 'neg_mean_squared_error'), return_train_score=True)

  ----- Cross Validation Results -----
  Train RMSE: 18669.24176487618
  Train RMSE as hours: 5.185900490243384
  Train R2: 0.7445142195526304
  Test RMSE: 6292082509.582213
  Test RMSE as hours: 1747800.6971061705
  Test R2: -56745276872.53794
  -------------------------------------

Feature reduction

  • As there are so many features in the data as a result of the dummy encoding, it may not be practical to investigate their impact on the model manually by inspecting coefficients
  • It would be a shame to potentially lose the information contained in the categorical variables
  • So I decided to investigate an automated way to see how many and which variables should be included

SelectKBest pipeline

  • Using a scikit-learn pipeline and SelectKBest to find the best variables to include in our model
  • Took a while!
# create dictionaries to store results
scores = {}
rmses = {}

# loop through all the possible numbers of variables included in the model,
# fit each one using a pipeline with SelectKBest and a Linear Regression model,
# and store the results in the dictionaries
for n in range(1, 430):
    lr_selected = make_pipeline(SelectKBest(f_regression, k=n), LinearRegression())
    lr_selected.fit(X_train, y_train)
    score = lr_selected.score(X_test, y_test)
    scores[str(n)] = score
    rmse = np.sqrt(metrics.mean_squared_error(y_test, lr_selected.predict(X_test)))
    rmses[str(n)] = rmse

SelectKBest results

  • 406 variables wins!

Model

  • Now the new model can be cross-validated, trained, and tested on the holdout testing data
lr_model = make_pipeline(SelectKBest(f_regression, k=406), LinearRegression())

new_val_scores = cross_validate(lr_model, X_train, y_train, cv=kf, scoring=('r2', 'neg_mean_squared_error'), return_train_score=True)

lr_model.fit(X_train, y_train)

# Calculate R2
lr_model.score(X_test, y_test)
# Calculate RMSE
np.sqrt(metrics.mean_squared_error(y_test, lr_model.predict(X_test)))

  ----- Cross Validation Results -----
  Train RMSE: 18671.68085122933
  Train RMSE as hours: 5.186578014230369
  Train R2: 0.7444475485541255
  Test RMSE: 18780.716818056364
  Test RMSE as hours: 5.216865782793435
  Test R2: 0.7413891601348425
  -------------------------------------

  ----- Final Model Results -----
  R2: 0.74
  RMSE: 18952.05
  RMSE in hrs: 5.26
  -------------------------------

Compare to baseline

  • ~5 hours may be quite a wide margin for a race result - does our final model, at the very least, beat a baseline model that is just guessing with the mean?
lr_dummy = make_pipeline(SelectKBest(f_regression, k=406), DummyRegressor(strategy='mean'))
lr_dummy.fit(X_train, y_train)

  ----- Dummy Model Results -----
  Dummy R2:-0.0
  Dummy RMSE: 36876.06
  Dummy RMSE in hrs: 10.24
  -------------------------------

Making predictions

  • With all the dummy variables in the model, it would be difficult to manually enter an encoded set of data to specify an athletes nationality or a race location
  • So I created a helper function that can be used to enter new data for predictions:
predict_new_runner(20, 0, "GBR", 155, 1000, 400, 10, 100, "Zagreb")

    ----- Predicted Outcome -----
    For a 20 year old male from GBR, running a 155 km race in Zagreb
    with an elevation gain of 600 ft, 100 other runners, and 10 aid stations.
    
    Predicted finishing time: 1 days 03:00:13.388959575
    -----------------------------
    

Bonus

Ridge Regression

  • After going through all of the previous steps, I discovered that using Ridge Regression seemed to solve the issue with testing R2 without the need to use SelectKBest
from sklearn.linear_model import Ridge

# instantiate Ridge Regression object
rdg = Ridge()

# perform cross validation for Ridge Regression model
rdg_val_scores = cross_validate(rdg, X_train, y_train, cv=kf, scoring=('r2', 'neg_mean_squared_error'), return_train_score=True)

# fit and score model
rdg.fit(X_train, y_train)
rdg.score(X_test, y_test)

# enter new data for a prediction from Ridge model
rdg_predict_new_runner(20, 0, "GBR", 155, 1000, 400, 10, 100, "Zagreb")

  ----- Cross Validation Results -----
  Train RMSE: 18678.84421521424
  Train RMSE as hours: 5.1885678375595115
  Train R2: 0.7442513730579005
  Test RMSE: 18780.14212984383
  Test RMSE as hours: 5.216706147178841
  Test R2: 0.74140731798193
  -------------------------------------

    ----- Predicted Outcome -----
    For a 20 year old male from GBR, running a 155 km race in Zagreb
    with an elevation gain of 600 ft, 100 other runners, and 10 aid stations.
    
    Predicted finishing time: 1 days 02:54:09.802431778
    -----------------------------
    

Wrapping up

Conclusion

  • Yes, you can predict an ultra marathon finishing time for a given athlete profile, in a given race
  • The model accounts for ~74% of the variability in a finishing time
  • But with an RMSE of ~5 hours, may not be particularly useful for elite athletes

If I had more time…

  • Investigated multi-collinearity within the dataset
  • Further investigated scikit pipelines for other steps such as regularisation
  • Investigated other regression models, such as ElasticNet
  • Used classification methods to work with ‘DNF’ data to classify if an athlete will even finish a given race

Thank you for listening!